-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diskann Benchmarking Wrapper #260
base: branch-25.02
Are you sure you want to change the base?
Conversation
…diskann-wrapper
…diskann-wrapper
…diskann-wrapper
…into diskann-wrapper
…into diskann-wrapper
…diskann-wrapper
…diskann-wrapper
…diskann-wrapper
@@ -32,6 +32,12 @@ option(CUVS_ANN_BENCH_USE_CUVS_BRUTE_FORCE "Include cuVS brute force knn in benc | |||
option(CUVS_ANN_BENCH_USE_CUVS_CAGRA_HNSWLIB "Include cuVS CAGRA with HNSW search in benchmark" ON) | |||
option(CUVS_ANN_BENCH_USE_HNSWLIB "Include hnsw algorithm in benchmark" ON) | |||
option(CUVS_ANN_BENCH_USE_GGNN "Include ggnn algorithm in benchmark" OFF) | |||
option(CUVS_ANN_BENCH_USE_DISKANN "Include DISKANN search in benchmark" ON) | |||
option(CUVS_ANN_BENCH_USE_CUVS_VAMANA "Include cuVS Vamana with DiskANN search in benchmark" ON) | |||
if(CMAKE_SYSTEM_PROCESSOR MATCHES "(ARM|arm|aarch64)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does MSFT DiskANN repo not support ARM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they have mkl-devel
as a dependency, which is not meant to be installed in aarch64.
template <typename T> | ||
void diskann_ssd<T>::build(const T* dataset, size_t nrow) | ||
{ | ||
diskann::build_disk_index<float>(base_file_.c_str(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this defined? Is it using the CPU DiskANN repo to build the ssd verson of the index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is right. It is just to benchmark the CPU build
/ok to test |
/ok to test |
…into diskann-wrapper
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid algorithm-specific changes to cpp/bench/ann/src/common/benchmark.hpp
as per our offline discussion.
I'm not opposed to exposing dataset/index filepaths to algorithms in general though.
But maybe it would make sense to hide (filter out) them when dumping all build parameters to avoid cluttering the console/reports and exposing private user filepaths?
@achirkin are you suggesting that we expose the dataset / index filepaths to all algorithms, but only use them for diskann ssd indexes? |
@@ -144,7 +150,8 @@ void bench_build(::benchmark::State& state, | |||
|
|||
const auto algo_property = parse_algo_property(algo->get_preference(), index.build_param); | |||
|
|||
const T* base_set = dataset->base_set(algo_property.dataset_memory_type); | |||
const T* base_set = nullptr; | |||
if (index.algo != "diskann_ssd") base_set = dataset->base_set(algo_property.dataset_memory_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@achirkin if we do not have this line, the entire dataset will be read into the base_set
variable needlessly for DiskANN ssd based indexes. For these indexes, only the path to the dataset needs to be known and the DiskANN index building APIs will read the data from the path. In fact, it is important we do it this way because those SSD indexes are designed to let the data remain on disk if there are extremely large datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just set the dataset_memory_type
to kHostMmap
by default for diskann index. It will lazily map the file content, so it won't read anything until you actually use the pointer.
It will still open the file and read the first two words anyway - independently of this change. That is because the benchmark reads and reports the dataset dimensionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to always adhere to these two rules:
- No algorithm-specific code in the common header files, such as
benchmark.hpp
- All configuration-specific code should go into
conf.hpp
, not intobenchmark.hpp
.
@@ -144,7 +150,8 @@ void bench_build(::benchmark::State& state, | |||
|
|||
const auto algo_property = parse_algo_property(algo->get_preference(), index.build_param); | |||
|
|||
const T* base_set = dataset->base_set(algo_property.dataset_memory_type); | |||
const T* base_set = nullptr; | |||
if (index.algo != "diskann_ssd") base_set = dataset->base_set(algo_property.dataset_memory_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just set the dataset_memory_type
to kHostMmap
by default for diskann index. It will lazily map the file content, so it won't read anything until you actually use the pointer.
It will still open the file and read the first two words anyway - independently of this change. That is because the benchmark reads and reports the dataset dimensionality.
if (index.algo == "diskann_ssd") { | ||
make_sure_parent_dir_exists(index.file); | ||
index.build_param["dataset_file"] = dataset->base_filename(); | ||
index.build_param["path_to_index"] = index.file; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you shouldn't need path_to_index
parameters, because the index member file
is accessible from within the index in your algorithm-specific code.
If you need the dataset file path during index build, you can either ingest it as a build param in the conf.hpp
, similar to how we inject some search parameters in parse_index(...), or create one more member in the configuration::index
- all in conf.hpp
; I think it's ok to have it available to all algorithm at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, the cuvs::bench::configuration::index
object's attributes are not directly passed down to the specific algorithm's wrapper class. So we would need the path_to_index
. The algorithm specific structs use only the parameters that are explicitly passed down from the configuration object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is: CAN we pass the cuvs::bench::configuration::Index
down into the wrappers? Or rather, any reason we shouldn't do it?
if (index.algo != "diskann_ssd") | ||
filename = index.file; | ||
else | ||
filename = index.file + "_disk.index"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same problem, the algorithm-specific code should go to the algorithm headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for checking if the index file exists. The diskann ssd index is saved with the suffix _disk.index
along with the filename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is algorithm-specific code in common benchmarking code, though. This should be moved into the diskann portions of the code. You have to understand that we create generalized abstractions for this very purpose- anything in common/benchmark.hpp
should be applied to all index algorithms and any specific things should be propagated down to the respective index types. Polluting the common code like this moves up the abstraction tree, rather than down, and it results in hard to maintain spaghetti code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another reason to propagate this down to the individual wrappers. You could at least specify a method on the wrapper class to allow any algorithms to override and adjust the filename as needed. That would be more robust than one-offs like this.
…into diskann-wrapper
…diskann-wrapper
/ok to test |
Brings DiskANN into cuvs-bench